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Abstract — We present a classification-based approach to iden- 
tify quasi-stellar radio sources (quasars) in the Sloan Digital 
Sky Survey and evaluate its performance on a manually labeled 
training set. While reasonable results can already be obtained via 
approaches working only on photometric data, our experiments 
indicate that simple but problem-specific features extracted from 
spectroscopic data can significantly improve the classification 
performance. Since our approach works orthogonal to exist- 
ing classification schemes used for building the spectroscopic 
catalogs, our classification results are well suited for a mutual 
assessment of the approaches' accuracies. 

Index Terms — classification, astronomy, feature extraction 

I. Introduction 

The automated analysis of data sets has become an increas- 
ingly important issue for researchers in astronomy (2j, (4). 
This is in particular the case for massive data sets obtained 
from, e.g., the Sloan Digital Sky Survey (SDSS), which is said 
to be "one of the most ambitious and influential surveys in 
the history of astronomy" |T2| . The latter catalog is currently 
based on raw data of about 60 terabytes and the trend towards 
data-intensive science seems to become even more evident 
with near- future projects such as the Large Synoptic Sky 
Telescope j9j, which will produce data volumes in the petabyte 
range. From a machine learning perspective, a variety of 
problems in astronomy can be formulated as supervised (e.g. 
classification, regression) or unsupervised tasks (e.g. cluster- 
ing, dimensionality reduction), and the corresponding tools for 
addressing these tasks have been recognized as "increasingly 
essential in the era of data-intensive astronomy" (4). 

We describe the use of supervised learning techniques to 
discriminate quasars from other celestial objects. We approach 
this problem from a machine learning point of view and 
analyze how the generalization performance of standard clas- 
sifiers can be improved by incorporating simple problem- 
specific features which are motivated by the physical prop- 
erties of quasars and other celestial objects. Although some 
classification problems have already been addressed in the 
field of astronomy, the interface between both communities 
is not well defined |4|. One of the goals of this paper is to 
work towards bridging the gap between both communities by 



describing an important astronomical classification problem 
(along with some physical background) and by preparing the 
corresponding data such it can be approached easily from a 
machine learning point of view. 

II. Background 

One of the main objectives of the SDSS consisted in finding 
quasi-stellar radio sources (quasars), sl special kind of Active 
Galactic Nuclei (AGN). Due to their extreme luminosities, 
quasars are among the most distant objects in the universe 
that can be observed. Depending on this distance it takes up 
to billions of years for the emitted radiation to reach the Earth. 
Therefore this radiation reveals information about the long ago 
state of a quasar and thus about the early universe. 

The widely accepted explanation for the nature of these 
extreme sources that reach the highest luminosities among 
all known astrophysical phenomena is the so-called unified 
model 1 15 ]. This model reduces the large number of different 
types of AGN to a single phenomenon: a Supermassive Black 
Hole (SMBH) surrounded by a thick dust torus that is accreting 
material. It is heated to extremely high temperatures (3J and 
thus produces radiation in nearly all frequency bands. The 
differences in how we observe AGNs is mainly caused by 
the inclination of the surrounding dust torus. If the inclination 
angle is small, we are able to observe the inner parts of the 
torus. In this case either an optically bright quasar or even a 
highly variable blazar is detected. The broad emission lines in 
the spectrum of a quasar (see below) result from the fact that, 
while observing the direct vicinity of the SMBH, one observes 
a region with a higher gravitational potential and thus higher 
orbiting velocities of the accretion disk material. Such high 
velocities yield to a Doppler-broadening of the emission lines, 
in this strength only present for AGN-powered sources. 

a) Related Work: The problem of identifying quasars has 
already been addressed in the field of astronomy with a focus 
on photometric data, see, e.g., the overview given by Ball and 
Brunner (2j and the references therein. To the best of our 
knowledge, the work reported upon in this paper is the first 
extensive machine-learning-based study on classifying quasars 
given spectroscopic data. 



III. Data 

Our classification approaches work on a subset of the 
SDSS (DR6) |T2| database. This data has been obtained via 
a 2.5-meter telescope at the Apache Point Observatory (New 
Mexico) which is equipped with two special-purpose instru- 
ments: a 120-megapixel camera and a pair of spectrographs. 
Both types of resulting data, namely the photometric as well 
as the spectroscopic data, are used in this work. 

A. Labels 

To obtain ground truth data, we asked an expert to manually 
label (based on normalized plots) a random subset of 5, 261 
spectra out of the 1,271,680 spectra available in the SDSS. 
The resulting catalog contains IDs and labels of 512 objects 
of type "quasar" and 4, 749 objects of type "other"^] 

B. Photometric Data 

The 120-megapixel camera collects simultaneously data 
through five different filters. By doing drift- scanning, those 
data sets are combined to overlaying stripes that cover five 
wavelength ranges per observation. These wavelength ranges 
are the so-called u, g, r, i, and z bands. In each of 
these bands, the SDSS photometric pipeline automatically 
extracts data for all celestial objects. Finally, in addition to the 
segmented raw data stripes, a batch of extracted photometric 
features is available for each object in each band. 

One group of these photometric features, the so-called 
magnitudes, are logarithmic measures of the brightness of an 
object. Among the most established approaches for data fitting 
are the PSF, the Petrosian, and the Model approach p2) . 
The output of these approaches can be retrieved from 
the PhotoOb jAll table of the Catalog Archive Server 
(CAS) |T2) and, since extracted from five bands, yields the 
following set of photometric features: 

1) PSF magnitudes: psfMag_u, psfMag_g, 
psfMag_r, psfMag_i, psfMag_z 

2) Petrosian magnitudes: petroMag_u, petroMag_g, 
petroMag_r, petroMag_i, petroMag_z 

3) Model magnitudes: modelMag_u, modelMag_g, 
modelMag_r, modelMag_i, modelMag_z 

C. Spectroscopic Data 

The raw spectra of all manually labeled objects have been 
retrieved from the Data Archive Server (DAS) 1 12 ]. Each spec- 
trum contains the flux values for roughly 3, 850 wavelengths. 
In order to obtain raw features that are independent of the 
specific wavelengths, the spectra of all considered objects have 
been aligned and truncated. This results in a set of spectra 
having d = 3, 825 features, see Figure [T] 

D. Quality of the Data Sample 

To estimate the quality of our data sample, we consider an 
important astronomical property, namely the redshift (value) 
of an object (12). In Figure [2] the distribution (based on the 

! The induced catalog along with the photometric and spectroscopic data 
can be obtained from the authors upon request. 




Fig. 1. Spectroscopic data for the manually labeled training set, which 
consists of objects of type "quasars" (blue) and of type "other" (red). 



z-values in the SDSS database) of all 512 quasars present in 
our data set is given. It can be seen that these objects cover 
a redshift range of up to z = 5 and that the majority has a 
low redshift (z < 2.3). The remaining objects are dominated 
by 4, 053 galaxies exhibiting a low redshift. 
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Fig. 2. Redshift distribution of quasars (left) and other objects (right). 

In Figure [3] the labels available in the SDSS database are 
compared with our manually obtained labels. Here, the blue 
plot corresponds to all 512 objects classified as objects of 
type "quasar" by our expert; the red plot corresponds to the 
remaining objects of type "other". The x-axis of both plots 
is based on the labels present in the SDSS database. It can 
be clearly seen that the labeling of our expert is in agreement 
with the one of the SDSS-pipeline|^] 
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Fig. 3. Comparison of our expert's manual labeling with the corresponding 
labeling of the SDSS database. 



IV. Classification Approaches 

The labeled data induces a (training) set T = 
{(xi,2/i),...,(x n ,2/ n )} C R d x {-1,+1} containing the 

2 It should be pointed out that the SDSS-pipeline is also based on human 
verification of the labels and, hence, can only be seen as a semi-automatic 
classification approach. 



features and labels for each object, where the positive class 
corresponds to all objects of type "quasar". We apply two well- 
known classification approaches, namely k-Nearest Neighbors 
(kNNs) and Support Vector Machines (SVMs). For complete- 
ness, we depict the key ideas of both classification models and 
refer to the corresponding literature for details p3| , (14), (16). 

A. k-Nearest Neighbors 

The /c-Nearest Neighbor classifier uses the k "closest" 
objects from the given set of already classified objects to 
assign a class to an unclassified object |T4) . More precisely, 
the (binary) classification Y"(x) for an object x is 



F(x) 



1 if/(x)>0 
-1 if/(x)<0, 



where 



/(x) 



(1) 



(2) 



Xi67V fc (x) 



and where iVfc(x) denotes the /c-nearest neighbors in the 
training set with respect to x. To define "closeness", arbitrary 
metrics can be used; a popular choice is the Euclidean metric 
(which we use for the experimental evaluation as well). 

B. Support Vector Machines 

The aim of a SVM consists in finding a hyperplane in a 
feature space which maximizes the "margin" between both 
classes such that only few training patterns lie within the 
margin (13), (16). The latter task can be formulated as an 
optimization problem, where the first term corresponds to 
maximizing the margin and the second term to the loss caused 
by patterns lying within the margin: 



minimize -||w|| 2 



s.t. ^((w,$(x,))+&) > 
and & > 0, 



(3) 



where C > is a user-defined parameter. The function 
: R d — » mapping the patterns into a feature space 
Ho is induced by a kernel function k : R d x R d — >> R 
with fc(x^Xj) = ($(xi), $(xj)). A kernel function can be 
seen as a "similarity measure" for input patterns. The goal 
of the learning process is to find the optimal hyperplane 
/(x) = (w,<£(x)) + b. Unseen objects can subsequently be 
classified via Equation ([I]). A common choice for the kernel 
function is the linear kernel 



k(^i i Xj ) (x^ , Xj ) 



or the RBF kernel 



fc(x£, Xj) = exp 



2a 2 



(4) 



(5) 



V. Spectroscopic Feature Extraction 

As discussed in Section [TTJ the physical properties of quasars 
result in characteristic, broad emission lines. Our approach 
consists in extracting meaningful features given "continuum- 
substracted" versions of the raw spectra. In the remainder of 
this section, we provide two ways for estimating the continuum 
(i.e. the rough shape) of a spectrum. Given such an estimate, 
we define several features motivated by the physical properties 
of quasars and non-quasars. 

A. Continuum Extraction with Splines 

The first approach for extracting the continuum is based on 
splines [6]. In a first step, we use a sliding-window technique 
to merge consecutive features (flux values) and to obtain 
a "binned" version of the spectroscopic data. This binned 
spectrum is subsequently normalized and smoothed using the 
Savitzky-Golay-Filter (TT) (which is a standard smoothing 
filter), see the black curve shown in Figure [4] Since the 
broad emission lines present in a quasar's spectroscopic data 
lead to an overestimation of the continuum, we apply an 
additional smoothing method to reduce the influence of the 
peak intensities. Given such a smoothed spectrum, we perform 
a spline interpolation to extract the estimated continuum (red 
curve). Finally, the difference of the smoothed spectrum and 
the estimated continuum is evaluated by rounding "nonrele- 
vant" values within a 5-range around the continuum, where 
S is computed as the standard deviation of all calculated 
differences. The resulting "continuum- subtracted" spectrum is 
represented by the green curve in Figure [4] 

B. Continuum Extraction with Support Vector Regression 

Our second approach for estimating the continuum is based 
on the concept of support vector regression (SVR) [13 ] which 
can be considered to be a special case of regularization 
problems of the form 



1 n 



M\f\\ 2 n, 



(6) 



where A > is a fixed real number, L : R x R — ^ [0, oo) is a 



where the parameter a is called the kernel width. 



loss function and \\f\\ n is the squared norm in a reproducing 
kernel Hilbert space (RKHS) H C R x = {f : X -> R} 
induced by the associated kernel function (note that, in this 
case, the kernel function is defined on the indices associated 
with the flux vales). By plugging in the so-called e-intensive 
loss with e > one obtains 

inf I^ ma x(|/(x i )-t/ i |-e,0)+A||/||^ (7) 

/ G H ft . 

which depicts the optimization problem of SVR. Here, the first 
term corresponds to the "difference" between the (continuum) 
model and the data and the second term corresponds to the 
"complexity" of the (continuum) model. Ideally, one would 
like to have a model which fits the data well and which is 
not too "complex". Additionally (and in contrast to standard 
regression problems), we would like to "ignore" the broad 




emission lines present in the quasars' spectra, see Figure [4] 
Two properties of the SVR approach are advantageous in 
this context: First, the ^-intensive loss function penalizes the 
differences of the model and the flux values only linearly (in 
contrast to, e.g., the square-loss (13)). Second, by adapting the 
involved parameters, one can adjust the behavior of the model 
such that the influence of large peaks is reduced]^] 

Given the resulting continuum models, we compute the 
continuum substracted spectra by considering the difference 
between each continuum model and the corresponding input 
spectrum while ignoring values lying within a small J-range 
around the continuum model (like above). 



TABLE I 

Extracted Features 



Feature 


Description 


Fl 


First value of the extracted 
continuum 


F2 


Last value of the extracted 
continuum 


F3 


Integral of all positive peaks 


F4 


Integral of all negative peaks 


F5 


Width of the broadest positive peak 


F6 


Width of the broadest negative peak 


F7 


Major peak intensity 


F8 


Minor peak intensity 


F9 


Major face of positive peak 


F10 


Major face of negative peak 



C. Extracted Features 

Given the estimated continuum and thus the continuum- 
substracted spectra, we try to extract meaningful features 
representing the main characteristics of typical quasars, see 
Table |TJ Since the (true) continuum of a quasar's spectrum is 
normally horizontal or decreasing, we use the first and the last 
value of the computed continuum as features to estimate the 
rough slope of the spectrum (Fl and F2). As described in Sec- 
tion broad emission lines provide an excellent characteristic 
for experts to classify quasars. The remaining eight features 
aim at capturing these "peak properties". Here, features F3 
and F4 denote the absolute values of the sum of all "positive" 
and "negative" peaks, respectively. Further, features F5, F6, 
F9, and F10 indicate the width and the face of the strongest 
peaks. Finally, features F7 and F8 capture the minimum and 
maximum peak intensities. 

VI. Experiments 

We consider several data sets which are based on the 
photometric and spectroscopic data in order to evaluate the 
generalization performances of our approaches. In the remain- 
der of this section, we provide a description of these data sets 
as well as a description of the experimental setup and the final 
outcome of the experimental evaluation. 

3 For instance, the kernel width a of the (used) RBF kernel can be adjusted 
such that large peaks have less influence on the continuum model. 



A. Data Sets 

Given the data, we generate a variety of data sets, see 
Table [TTJ Each data set contains all N = 5, 261 objects 
and, thus, all p = 512 objects of type "quasar" and all 
n = 4, 749 objects of type "other". The first four data 
sets D1-D4 are based on the different magnitudes which 



are retrieved from the photometric data, see Section III The 
data sets D5-D7 are binned versions of the raw spectra, 
i.e., the original input dimension (3,825) of each spectrum 
is reduced to d = 5,<i = 100 and d = 500 respectively 
by "averaging" consecutive flux values such that the desired 
output dimension is obtained. The data sets DS and D9 are 
obtained via the spectroscopic feature extraction approaches 
depicted in Section [Vj where we use binned versions of the 
spectra with d = 500 dimensions as starting point. The last 
data set D10 is based on the spectroscopic feature extraction 
(SVR) in combination with a set of photometric features]^] 

B. Experimental Setup 

Except for the SVM implementation, all preprocessing steps 
and algorithms are implemented in Python. For the SVM 
model, we resort to the LIBSVM implementation provided by 
Chang and Lin [5]. 

4 This data set aims at the situation, where both parts of the data (photo- 
metric and spectroscopic) are given for all objects (which is the case for, e.g., 
the LAMOST project (5)). 



TABLE II 
Considered Data Sets 



TABLE III 
Classification Performances 



Data Set 


Features 


Dl 


psfMag_u - psfMag_g, psfMag_g - psfMag_r, 
psfMag_r - psfMag_i, psfMag_i - psfMag_z 


D2 


psfMag_u, psfMag_g, psfMag_r, psfMag_i, 
psfMag_z 


D3 


psfMag_u, psfMag_g, psfMag_r, psfMag_i, 
psfMag_z, modelMag_u, modelMag_g, 
modelMag_r, modelMag_i, modelMag_z 


D4 


psfMag_u, psfMag_g, psfMag_r, psfMag_i, 
psfMag_z, petroMag_u, petroMag_g, 
petroMag_r, petroMag_i, petroMag_z 


D5 


BinnedSpec5 


D6 


BinnedSpeclOO 


D7 


BinnedSpec500 


D8 


ExtractedFeatures (SPLINE) 


D9 


ExtractedFeatures (SVR) 


D10 


ExtractedFeatures (SVR), psfMag_u, 
psfMag_g, psfMag_r, psfMag_i, psfMag_z, 
modelMag_u, modelMag_g, modelMag_r, 
modelMag_i, modelMag_z 



b) Parameters: To train and evaluate the classification 
approaches, half of each data set is used as training and 
the other half as test set. For the SVM model, a RBF 
kernel with kernel width a is used. The corresponding model 
parameters fc, C, and a for both classification approaches are 
tuned via 10-fold cross-validation fl4| on the training set 
given a grid of parameters (k G {1, . . . , 10} for kNN and 
(C» G {2- 10 ,2- 9 ...,2+ 10 } x {2- 10 ,2- 9 ,...,2+ 10 } for 
the SVM). For the spectroscopic feature extraction, we set 
the involved parameters to fixed values for all spectra: For 
the spline-based approach, the binned spectrum is smoothed 
using the Savitzky-Golay-filter of degree 3 and window size 
15. Afterwards we use the interpolation package of the Scipy 
library [7 ] for estimating the smoothed spline of order 4 and 
smoothing factor 100. For the SVR-based approach, we again 
make use of the LIBSVM implementation using a RBF kernel 
and input parameters C = 100, gamma = 0.00001, and 
epsilon = 0.1. 

c) Measuring the Classification Performance: The per- 
formances of our models are evaluated on the test set. Since 
our data sets are imbalanced, we resort to the Matthews 
Correlation Coefficient (MCC) fT0| as quality measure: 

TP • TN - FP • FN 

y/( TP + FN)(TP + FP)(TN + FP)(TN + FN) 

Here, TP, FP, TN and FN denote the number of true 
positives, false positives, true negatives, and false negatives, 
respectively (TJ, flO) . We also provide the true positive rate 
(TP -rate) given by TP/p as well as the false positive rate 
(FP-rate) given by FP/n, which can be seen as "hit rate" 
and "false alarm rate", respectively (TJ. Finally, we consider 
the error on the test set given by (FP-\-FN) /N which simply 
measures the overall number of misclassifications. 

C Results 

In the remainder of this section, we discuss the outcome of 
the conducted experiments. 



Data 
Set 


kNN 


MCC 


Error 


TP - rate 


FP - rate 


Dl 


0.873 


2.32% 


86.1% 


1.0% 


D2 


0.826 


3.15% 


81.7% 


1.4% 


D3 


0.902 


1.82% 


91.2% 


1.0% 


D4 


0.893 


2.01% 


91.6% 


1.3% 


D5 


0.748 


4.41% 


70.6% 


1.5% 


D6 


0.824 


3.15% 


80.2% 


1.2% 


D7 


0.808 


3.42% 


77.7% 


1.2% 


D8 


0.963 


0.69% 


95.6% 


0.3% 


D9 


0.959 


0.76% 


96.7% 


0.5% 


D10 


0.971 


0.53% 


96.3% 


0.2% 




Data 
Set 


SVM (RBF) 


MCC 


Error 


TP — rate 


FP - rate 


Dl 


0.859 


2.54% 


84.2% 


1.0% 


D2 


0.851 


2.74% 


85.0% 


1.3% 


D3 


0.922 


1.44% 


91.9% 


0.7% 


D4 


0.908 


1.71% 


91.5% 


0.9% 


D5 


0.757 


4.33% 


73.6% 


1.8% 


D6 


0.788 


3.69% 


71.7% 


0.8% 


D7 


0.610 


6.23% 


44.0% 


0.5% 


D8 


0.965 


0.65% 


96.0% 


0.3% 


D9 


0.977 


0.42% 


97.4% 


0.2% 


D10 


0.971 


0.53% 


96.7% 


0.2% 



d) Classification Performance: The classification perfor- 
mances of both classification approaches on the various data 



sets are shown in Table III where the best results are high- 
lighted (bold face type). Both the kNN and the SVM model 
perform well on the first four data sets, which are composed 
of the magnitudes retrieved from the photometric data|^] While 
the performance is worse on the data sets containing the binned 
versions of the spectra (D5-D7), it improves significantly 
when the extracted features (D8 and D9) are used as input 
data. Using additional photometric features (D10) does not 
lead to an imrovement. 

e) How Much Spectroscopic Data is Needed: The spec- 
troscopic feature extraction approaches used for the data sets 
D8-D10 are based on binned versions of the raw spectra 
containing d = 500 values. A natural question is "how 
detailed" these binned versions have to be such that the 
spectroscopic feature extraction yields good results. One of 
the benefits of being able to use "low detailed" versions of 
the raw spectra is the performance gain with respect to the 
computation time (which will be of tremendous importance 
for an efficient classification of larger portions of the SDSS 
or even the full catalog). 

In Figure [5] the influence of the input dimension on the 
classification performance (MCC) is shown. The plot indicates 
a quite similar classification performance of both approaches 
on all considered data sets. Further, it seems that data sets 
based on input dimensions of roughly d = 200 already yield 

5 We would like to point out that a generalization of these results to the 
complete photometric catalog is dubious due to the fact that the sample is 
biased (a target selection criteria was used for collecting the spectroscopic 
data (H)). 
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Fig. 5. Influence of the input dimension on the MCC. 



good results, which cannot be improved significantly by using 
a higher input precision. We expect this observation to be 
helpful when considering a time/classification quality trade- 
off for processing massive databases, e.g. in the context of the 
LAMOST project (§. 

VII. Conclusions and Outlook 

We have considered the task of using classification al- 
gorithms to discriminate quasars from other objects given 
photometric and spectroscopic data. In recent years, problems 
of this kind have gained more and more attention in the field 
of astronomy due to the fact that the massive amounts of 
data arising nowadays cannot be handled without automatic 
(machine learning) approaches. While the problem of detecting 
quasars (especially using only photometric data) has already 
been considered in the field of astronomy, we are not aware 
of corresponding literature (with respect to spectroscopic data) 
in the field of machine learning, and the orthogonality of 
the existing and proposed approaches seems to allow for a 
mutually beneficial collaboration between these fields. The 
shear volume of astronomical data that is already available 
or will be available in the near future emphasizes the need for 
high-performance machine learning algorithms. 
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